Introduction to data science in R, Lesson 3:
Functinos and for loops


Brian S. Evans, Ph.D.
Migratory Bird Center
Smithsonian Conservation Biology Institute


In this lesson we will explore for loops and functions. A for loop is a block of code that allows you to repeat an operation a set number of iterations. A function is a set of code that allows you to create or modify objects in a repeatable way. As data analysis often includes many repeated tasks, for loops and user-defined functions can save considerable time, shorten R scripts, and make your scripts more legible. Understanding how they work is crucial to developing good coding practices.


Setup for the lesson:

For this lesson, we will use three example data frames: Iris sepal and petal measurements collected by botanist Edgar Anderson in the 1920’s, bird traits collected by Wilman et al. (2014), and bird point count observations from Washington, DC (Evans et al. 2016).

Open R Studio and run the following lines of code to load the data into R:


#=================================================================================*
# ---- set up ----
#=================================================================================*

# Load libraries

library(RCurl)

library(tidyverse)

# Note: If you have yet to install these library, please do so with: 
# install.packages('rCurl') ; install.packages('tidyverse')

# Provide the web addresses of the files (note: iris is preloaded example data):

habitsURL <- getURL('https://raw.githubusercontent.com/bsevansunc/workshop_languageOfR/master/birdHabits.csv')

countsURL <- getURL('https://raw.githubusercontent.com/bsevansunc/workshop_languageOfR/master/birdCounts.csv')

# Read in the data:

birdHabits <- tbl_df(read.csv(text = habitsURL))

birdCounts <- tbl_df(read.csv(text = countsURL))

Take a moment to explore the data:


# Explore the data (birdHabits example):

head(birdHabits)

dim(birdHabits)

attributes(birdHabits)

str(birdHabits)

summary(birdHabits)

Have a look at the built-in iris dataset as well:


# Explore the iris data:

head(iris)

dim(iris)

attributes(iris)

str(iris)

summary(iris)

For the iris dataset, I think it’s best to do some familiar cleaning steps:


# Clean up iris for analysis:

irisTbl <- tbl_df(iris)

names(irisTbl) <-
  c('sepalLength',
    'sepalWidth',
    'petalLength',
    'petalWidth',
    'species')

Functions

In the previous lessons, we have worked with several functions (e.g., “c”, “mean”). Functions allow you to simplify complex tasks, which is especially useful if you need to run a task multiple times. Program R contains many functions and many more still are created by R users and provided to the R community as collections of functions known as packages (or libraries). As datasets and data handling needs are often distinct, relying exclusively on built-in and community-defined functions may be limiting.

Why would you write your own functions? Learning how to create your own functions, or customize existing functions, provides you with the flexibility to solve unique problems, shorten your script, and make your R code more legible for others. Writing your own functions is easy, as long as you follow the correct syntax. The basic structure is:


theNameOfMyFunction <- function(objectToComputeFunctionFor) {
    What you would like to happen when you run your function
}


Let’s illustrate writing functions with a pretty silly example – a function that will add 1 to any value. Below, we assign a name to the function, addOneFun <-, tell R that the code in the enclosing curly brackets is a function of some object x (function(x)), and provide the argument the function will be evaluated and returned (x + 1).


# First function:

addOneFun <- function(x){
    x+1
}

# Testing the function on a numeric value:

42+1

addOneFun(42)

# Testing the function on a vector of numeric values:

v <- c(1,1,2,3,5)

v + 1

addOneFun(v)


If you are doing a lot of data querying, writing your own query functions can be a great way to make your code more legible. Let’s use the birdCounts data to explore writing custom query functions. For example, perhaps you frequently need to query the number of individuals of a given species at each site:


# Query by species function:

speciesSubset <- function(spp){
  birdCounts[birdCounts$species == spp, ]
}

# Test function:

birdCounts[birdCounts$species == 'grca', ]

speciesSubset('grca')

It’s best to write your functions as general as possible. For example, the species subset function above is great if you are going to query just the birdCounts data frame. If you also intend to query the birdHabits frame by species, you may want to make the data frame one of the arguments.


# Query by species function, generalized:

speciesSubset <- function(dfIn, spp){
  dfIn[dfIn$species == spp, ]
}

# Test function, birdCounts:

birdCounts[birdCounts$species == 'grca', ]

speciesSubset(birdCounts, 'grca')

# Test function, birdHabits:

birdHabits[birdHabits$species == 'grca', ]

speciesSubset(birdHabits, 'grca')

Exercise One:


  1. Write a function to subset the birdHabits data frame to just ground foraging birds.
  2. Write a function that will calculate the total number of individuals observed of a given species in the birdCounts data frame(i.e., the sum of count for a data frame subset by species)




Functions can sometimes get pretty long and include multiple objects. When this is the case, it’s good practice to include return to illustrate which object is being returned. Let’s write a function in multiple steps that calculates the average (mean) number of individuals of a given species observed on a point count.

# Query by species function, generalized:

meanSpeciesCounts <- function(spp){
  # Number of unique site values:
  nSites <- length(unique(birdCounts$site))
  # Subset birdCounts to the species of interest:
  birdCounts_sppSubset <- birdCounts[birdCounts$species == spp, ]
  # Calculate the total number of birds observed:
  nBirds <- sum(birdCounts_sppSubset$count)
  # Return mean number of birds per site:
  return(nBirds/nSites)
}

# What is the average number of observed catbirds?

meanSpeciesCounts(birdCounts, 'grca')

Functions can also be nested. In other words, functions can include user-defined functions written inside the function body. We have already done this quite a bit with functions in base R. For example, the length function above is nested inside our user-defined function meanSpeciesCounts. Let’s try to nest our speciesSubset function inside the meanSpeciesCounts function above.


# Query by species function, generalized:

meanSpeciesCounts <- function(dfIn, spp){
  # Number of unique site values:
  nSites <- length(unique(dfIn$site))
  # Calculate the total number of birds observed:
  nBirds <- sum(speciesSubset(dfIn, spp)$count)
  # Return mean number of birds per site:
  return(nBirds/nSites)
}

# What is the average number of observed catbirds?

meanSpeciesCounts(birdCounts, 'grca')


Exercise Two:


  1. Using the birdHabits data frame, write a function to count the number of species in a given diet and foraging guild.
  2. The mathematical formula for standard error is provided below. Convert this to an R function (Note: the function for standard deviation is sd and the function for square root is sqrt):
  3. \[StdErr (x) = \frac{StDev(x)}{\sqrt{n}}\]




picture of the flower Iris setosa

For loops


Why would you use for loops? Let’s look at a common example. You have been asked to calculate the mean petal length of three Iris species: Iris setosa, Iris versicolor, and Iris virginica. (coded as the factor levels setosa, versicolor, and virginica in the species field of the irisTbl data frame). Using the mean function and tools that we have addressed thus far, use the following to calculate the mean petal length for each species:


# Filter irisTbl to setosa:

irisTbl[irisTbl$species == 'setosa', ]

# Extract the petalLength field (column):

irisTbl[irisTbl$species == 'setosa', ]$petalLength

# Calculate the mean of petal lengths:

mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)

Exercise One:


Calculate the mean petal length of each of the Iris species using matrix notation (as above) and a custom function.



The code you generated above is very repetitive – the only change in each of the lines you should have created above was the name of the species. Writing code like this makes your scripts unnecessarily long and prone to user input error. For loops should be used whenever a chunk of code is repeated more than twice.

We’ll return to the Iris example a bit later, but for now let’s start with a very simple for loop.



Writing proper for loops requires following these three steps:


  1. Output: Always define an object for storing output (e.g., an empty vector, matrix, or list)
  2. Sequence: The locations for which the loop will run
  3. Body: This is the instructions for what will occur during each iteration of the loop


Let’s use some example data. We’ll construct a vector, v using a set of five numbers.


v <- c(1,1,2,3,5)
v


We would like to modify the values in vector v by adding one to each value. This might be written mathematically as:

\[V_{new, i} = V_{i} + 1\]


Recall that value v[i] is equal to the value at position i in vector v. Let’s take a look at the value of v at position 3:


v[3]

i <- 3

v[i]

v[3] == v[i]


If we want to calculate v + 1 at position 3, we would write:


v[3] + 1

For loop, output:


For loop development begins by defining an object of a given length that your for loop will write to. This step is very important – your for loop will run much, much slower if you do not do so!


Vector objects are defined as follows:


vNew <- vector('numeric', length = length(v))

str(vNew)

The first argument of the vector function is the type of object you would like to create. The next argument sets the length of the created object to be the equivalent to that of v.


Values of vector vNew will be replaced location-by-location. For example, let’s compare the initial value of v with the resultant value of v + 1 at position 3:


i <- 3

v[i]

vNew[i] <- v[i] + 1

vNew[i]

v[i] + 1 == vNew[i]


For loop, sequence


The utility of the for loop is that we can calculate the above for each position (i) in vector v. This is done by setting the “for loop sequence” statement which defines the locations over which the for loop will be calculated. The for loop sequence for locations one through five is written as (DO NOT RUN):


# for(i in 1:5)


The above can be translated as “for position i in positions one through five”.

To make our code as flexible as possible, we generally do not want to have to directly type in the stopping point of the for loop. We can use 1:length(v) or the function seq_along to specify the for loop address. Run the following and note the behavior:


v

1:5

1:length(v)

seq_along(v)


The for loop sequence statement can then be written as (DO NOT RUN):


# Example for loop sequence statements:

# for(i in 1:length(v))
  
# for(i in seq_along(v))


The above statements are more flexible than providing a numeric stopping point for your for loop.

For loop, body

The for loop sequence statement is followed by the “body” statement that tells R what to do during each iteration of the loop. The body associated with our “add one” formula is (DO NOT RUN):


vNew[i] <- v[i] + 1


If your for loop spans multiple lines, place the body within curly brackets, for example (DO NOT RUN):


# for(i in 1:length(x)){
#   body
# }


The complete for loop


Putting together our steps of: 1) Creating an output object, 2) The for loop sequence statement, and 3) The body statement, our completed for loop is written as follows (RUN THIS ONE):


vNew <- numeric(length = length(v))

for(i in seq_along(v)){
 vNew[i] <- v[i] + 1
}


Take a look at the output and compare it with the values of v:


vNew

v

vNew == v


We specify the length of the vector to provide R with stopping rules – without this for loops can become very memory hungry when running over large datasets


The following two sections of code are equivalent, but the latter much easier to read. As writing code is both a tool and a method of communication, you should ensure that your code is as readable as possible.


for(i in 1:length(v)) vNew[i] <- v[i] + 1

for(i in 1:length(v)){
  vNew[i] <- v[i] + 1
}

When writing for loops, it is necessary to ensure that the loop is doing what you expect it to do. A simple way to ensure that this is the case is to specify i and run the instructions on just that value. For example, to observe the behavior of the for loop at position 3:


i = 3

vNew[i] <- v[i] + 1

vNew[i]

v[i]

Exercise Two:


  1. Use a for loop to test whether each species of bird in the birdHabits data frame is an omnivore (logical test).
  2. Calculate the total number of records associated with omnivorous birds.




For loop application, split-apply-combine


You may have noticed that none of the operations we completed in the previous section actually required for loops (for example, v + 1 is calculated for each value of v by default). For loops are predominantly used to split-apply-combine data. In data science, split-apply-combine problems are those that relate to situations in which you seek to split a dataset into multiple parts, apply a function to each part, and combine the resulting parts. For loops can be a great tool for split-apply-combine problems.

The Iris example at the start of this lesson, however, highlights their utility for summarizing data. Calculating the mean for each species without a for loop required the following code:


# Mean petal lengths of Iris species without a for loop:

mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)

mean(irisTbl[irisTbl$species == 'versicolor', ]$petalLength)

mean(irisTbl[irisTbl$species == 'virginica', ]$petalLength)


We can use a for loop to calculate this value across species. To do so, we first need to create a vector of species:


# Make a vector of species to loop across:

irisSpecies <- levels(irisTbl$species)

irisSpecies


Next, we’ll create an empty vector to store our output:


# For loop output statement:

petalLengths <- vector('numeric',length = length(irisSpecies))

petalLengths


The vector of Iris species will define the bounds of our for loop sequence (DO NOT RUN):


# For loop sequence: 

# for(i in seq_along(irisSpecies))


Our for loop body is constructed as:


# Exploring the construction of the for loop body:

i <- 3

irisSpecies[i]

irisTbl[irisTbl$species == irisSpecies[i], ]

irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength

mean(irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength)

# The for loop body:

petalLengths[i] <- mean(irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength)


Putting it all together, the for loop would be written as:


# Make a vector of species to loop across:

irisSpecies <- levels(irisTbl$species)

# For loop output statement:

petalLengths <- vector('numeric',length = length(irisSpecies))

# For loop:

for(i in seq_along(irisSpecies)){
  petalLengths[i] <- mean(irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength)

}


We can turn our result into a tidy tibble data frame, using the data_frame function:


petalLengthFrame <- data_frame(species = irisSpecies, count = petalLengths)

petalLengthFrame

Exercise Three:


Use a for loop and the birdHabits data frame to calculate the number species in each diet guild.





To illustrate how this would work, we will use birdCounts data frame and calculate the total number of omnivorous birds that were observed at each of the sites. Take a moment to explore the data, we will need to use both data frames:


# Explore the bird count data:

head(birdCounts)

str(birdCounts)

# Explore the bird trait data:

head(birdHabits)

str(birdHabits)

Our first goal will be to get a vector of birds that are ground foragers.


# Explore the bird count data:

omnivores <- birdHabits[birdHabits$diet == 'omnivore',]$species

Split: To evaluate the number of omnivores per site, we will split the data into individual sites. To do so, we generate a vector of unique sites and query the data frame by the site at a given position in the site vector.


# Generate a vector of unique sites:

sites <- unique(birdCounts$site)

# Site at position i:

i <- 3

sites[i]

# Subset data:

birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]

birdCounts_siteSubset

Apply: We then apply a function to each individual part. In this case, we will use %in% to extract only records associated with omnivores and sum the count field.


# Just a vector of counts:

countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count

# Get total number of omnivores at the site:

nOmnivores <- sum(countVector)

Combine: Output may be combined together as we have with previous for loop statements in this lesson. In the code below, I have combined the sites and nOmnivore vectors into a tibble data frame:


# Generate a vector of unique sites:

sites <- unique(birdCounts$site)

outVector <- vector('numeric', length = length(unique(sites)))

for(i in seq_along(sites)){
  birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
  countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
  outVector[i] <- sum(countVector)
}

# Combine:

data_frame(site = sites, nOmnivores = outVector)

An alternative to combining vectors that I often find useful is combining a list of data frames using the tidyverse function bind_rows:


# Generate a vector of unique sites:

sites <- unique(birdCounts$site)

outList <- vector('list', length = length(unique(sites)))

for(i in seq_along(sites)){
  birdCounts_siteSubset <- birdCounts[birdCounts$site == sites[i],]
  countVector <- birdCounts_siteSubset[birdCounts_siteSubset$species %in% omnivores, ]$count
  outList[[i]] <- data_frame(site = sites[i], nOmnivores = sum(countVector))
}

# Combine:

bind_rows(outList)

For loop application, simulation


We may also be interested in using a for loop to generate a vector of numbers based on some mathematical operation. For example, consider we have a value, 10, and want to calculate the mathematical expression:

\[n_t = 2(n_{t-1})\]

Let’s start by creating a vector of length 5 for our output:

# For loop output statement:


n <- vector('numeric', length = 5)

n

We must first start with a “seed” value for our vector. This is the initial value of vector n.


# Setting the seed value:

n[1] <- 10

n


Let’s calculate the first five values in this series. Because our for loop starts with our seed value, we are only interested in the second through fifth positions of this vector. Thus, our for loop sequence statement is (DO NOT RUN):


# For loop sequence:

# for(i in 2:length(n))


For each iteration, the following body statement will be evaluated (the example is at position 2):

# Exploring the construction of the for loop body:

i <- 2

n[i]

n[i-1]

n[i] <- 2*n[i-1]

n

And our complete for loop statement becomes:


# Output:

n <- vector('numeric', length = 5)

# Seed:

n <- 10

# For loop:

for(i in 2:5){
 n[i] = n*v[i-1]
}

Exercise Four:


One of my favorite for loops was created by Leonardo Bonacci (Fibonacci). He created the first known population model, from which the famous Fibonacci number series was created. He described a population (N) at time t as the sum of the population at the previous time step plus the time step before that:

\[N_t = N_{t-1} + N_{t-2}\]
  1. Use the combine function to create a seed vector of two values, zero and one.
  2. Use the formula above and your seed vector to generate the first 20 numbers of the Fibonacci number sequence. Hint: The for loop initialization will begin at third position – it will NOT include all of 1:20!



OPTIONAL CHALLENGE!

You can use for loops and functions to examine fundamental theoretical models in ecology. (Only) If you’re feeling super comfortable with for loops and functions, try out these additional exercises:


  1. The species-area relationship of Arrhenius (1921) states that the number of species (S) is dependent on the rate of increase in species with increasing area (z) multiplied by a constant (c) that represents the number of species that would be present in one spatial unit. This model is summarized as: \[S = c \cdot\ A^z\]
  1. Convert the species-area relationship formula to an R function, with c, A, and z set as unknowns.
  2. Create a vector named areaVector of areas values ranging from 1 to 10.
  3. Using 15 for the value of c and a z of 0.75, use a for loop to calculate the expected species richness for your areaVector. Note: You don’t really need to use a for loop for this, but it’s good practice!

  1. The exponential population growth model describes the growth of a population without resource limitations. This model describes the size of a population (\(N_t\)) at a given time unit as a function of the per capita growth rate of the population (r) and the size of the population at time t. This model is summarized as: \[N_t = N_0 \cdot e^{r \cdot t}\]
  1. Convert the exponential population growth model to an R function with r, \(N_0\), and t as unknowns.
  2. With an initial population size of 25 and a per capita growth rate (r) of 0.45, use the exponential population growth rate function and a for loop to calculate the size of the population over 10 generations.

  1. The logistic population growth model (Verhulst) states that the growth of the population will be constrained by the population carrying capacity (K) as a function of resource limitations. This model is summarized as (Note: the remaining variables are as above): \[N_t = \dfrac{K}{\mathstrut{1 + \bigg[ \dfrac{K-N_0}{N_0} \bigg] \cdot e^{-r \cdot t}}}\]
  1. Convert the logistic population growth model to an R function with r, \(N_0\), t, and K as unknowns.
  2. With an initial population size of 25, a per capita growth rate (r) of 0.45, and a carrying capacity of 500 individuals, use the logistic population growth rate function and a for loop to calculate the size of the population over 10 generations.